Skip to content

fix: redact is_sql, remove zero-variance is_mql (2/3 of #57)#58

Merged
shaypal5 merged 2 commits into
mainfrom
fix/issue-57-is-sql-is-mql
May 4, 2026
Merged

fix: redact is_sql, remove zero-variance is_mql (2/3 of #57)#58
shaypal5 merged 2 commits into
mainfrom
fix/issue-57-is-sql-is-mql

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Bundle schema v2 → v3. Closes issue #57 sub-items 2 and 3, and (after the self-review pass in commit 55e5948) closes a redaction-bypass that was already present in v2 from PR #56.

Closes 2 of 3 in #57:

  • Sub-item 2: is_sql=False near-deterministic for non-conversion → redact in student_public (now applied to both task splits and tables/leads.parquet)
  • Sub-item 3: is_mql constant True → fully remove from the schema (LeadRow, simulator, all parquets, all modes)
  • Sub-item 1: windowed-snapshot rebuild → separate PR

Bonus fix in this PR: the redaction now applies to relational tables too, closing a bypass that affected current_stage from v2 onward.

Why these changes

is_mql is constant True in the simulator. Every lead is initialised at current_stage="mql" in simulation/population.py, so is_mql == True for 100% of leads. Zero-variance — no information for modelling. Solution: fully remove from LeadRow, DTYPE_MAP, population.py, engine.py, and the canonical feature list. Gone from every artifact in every mode.

is_sql=False is near-deterministic for non-conversion. Measured across 5 seeds on full-size bundles:

Tier P(conv | is_sql=True) P(conv | is_sql=False)
intro 0.4422 ± 0.0286 0.0607 ± 0.0255
intermediate 0.2261 ± 0.0117 0.0204 ± 0.0097
advanced 0.0876 ± 0.0079 0.0109 ± 0.0042

At advanced tier, is_sql=False predicts non-conversion with ~99% probability across all measured seeds. A student trained their first model on (label = is_sql) would clear the baseline AUC for free — embarrassing on the hardest tier. Solution: tag with redact_in_modes={ExposureMode.student_public}, retain in research_instructor for DGP-aware research.

Bypass closure (added in commit 55e5948 after self-review). In v2, redaction applied only to tasks/converted_within_90_days/*.parquet. Users following the README's "Option 3" (feature engineering off the raw tables/leads.parquet) could trivially join current_stage (or now is_sql) back into their feature set by lead_id. In v3, redacted_columns_for(mode) is applied uniformly to every published parquet under both tables/ and tasks/. This also retroactively closes the same bypass that PR #56 left open for current_stage.

Why bump BUNDLE_SCHEMA_VERSION

Removing is_mql changes the canonical feature list in every exposure mode — not just a mode-conditional redaction like PR #56. Combined with the new relational-table redaction surface, downstream consumers may legitimately want to gate on the schema version.

The bump was confirmed in chat with the user before commit (not in the PR thread — apologies for the prior PR description's misleading wording).

Bundle column counts (after this PR)

task split tables/leads.parquet
student_public/{intro,intermediate,advanced} 32 (was 34) 9 (was 12)
research_instructor/intermediate_instructor 34 (was 35) 11 (was 12)

manifest.redacted_columns for student_public bundles lists ["current_stage", "is_sql"].

New automated checks

  • tests/render/test_bundle_schema_v3_contract.py (8 tests): hardcoded expected column lists per mode for both task splits and tables/leads.parquet. The duplication is the point — any future change to the feature spec, LeadRow, or redaction policy that also changes the published column set must update this contract. A bare schema-changing PR that doesn't touch this file fails here, forcing explicit version coordination.
  • test_no_zero_variance_features strengthened: alongside the strict nunique >= 2 guard, on bundles ≥ 200 rows it also asserts low-cardinality columns have rarest class ≥ 1% of rows. Skipped on tiny test fixtures.

Files changed

  • leadforge/schema/entities.py — drop is_mql from LeadRow + DTYPE_MAP.
  • leadforge/schema/features.py — drop is_mql FeatureSpec; add redact_in_modes={student_public} to is_sql.
  • leadforge/simulation/population.py, engine.py — drop is_mql=True constructor calls.
  • leadforge/api/bundle.py — apply redacted_columns_for(mode) to every relational table write before snapshot.
  • leadforge/render/manifests.py — bump BUNDLE_SCHEMA_VERSION to "3".
  • leadforge/validation/realism.py, drift.py — skip current_stage checks when the column is absent (student_public mode).
  • leadforge/validation/invariants.pycheck_exposure_monotonicity accepts legitimate column differences in tables/leads.parquet (subset of the redaction set).
  • tests/exposure/test_redaction.py — strengthen the zero-variance test.
  • tests/render/test_bundle_schema_v3_contract.py — new schema contract.
  • tests/schema/test_entities.py, tests/simulation/test_engine.py, tests/simulation/test_population.py, tests/validation/test_realism.py — drop / replace is_mql references; parameterize boolean-column choice.
  • release/README.md, release/HF_DATASET_CARD.md — table-form leakage handling section, multi-seed numbers.
  • CHANGELOG.md — describe v3 with multi-seed measurements and the relational-table redaction.
  • .agent-plan.md — strike sub-items 2 and 3 from Structural leakage in student_public bundles (post-#56 follow-up) #57.

Verification

  • pytest: 911 passing (was 902 before this PR; +9 net: +8 contract, +1 stronger zero-variance, -1 removed is_mql sim test, others updated)
  • ruff check / ruff format --check / mypy leadforge/: clean
  • All 4 release bundles regenerate and pass validate_bundle()
  • python scripts/verify_hash_determinism.py: 73/73 files identical across two consecutive builds
  • Spot check on regenerated bundles:
intro:                    task ncols=32  leads ncols=9
intermediate:             task ncols=32  leads ncols=9
advanced:                 task ncols=32  leads ncols=9
intermediate_instructor:  task ncols=34  leads ncols=11

current_stage, is_sql, is_mql all absent from every student_public artifact (task split + relational table + feature dictionary).

Out of scope

Issue #57 sub-item 1: event-aggregate features (touch_count, session_count, pricing_page_views, expected_acv, days_since_last_touch, ...) are still computed over the same 90-day window the label resolves in. The structural fix is a windowed snapshot rebuild (snapshot_day < label_window_days). That PR will likely bump the schema again and recalibrate documented conversion rates.

Test plan

  • pytest — 911 passing
  • ruff check / mypy — clean
  • All 4 release bundles regenerate and validate
  • Hash determinism — 73/73 identical
  • Manifest shows bundle_schema_version: "3" and the expected redacted_columns
  • No zero-variance feature in any student_public task split (new test)
  • No current_stage/is_sql in tables/leads.parquet for student_public (new contract test)
  • No is_mql anywhere in any bundle (new contract test)

🤖 Generated with Claude Code

Bundle schema v2 → v3.  Addresses two of the three structural-leakage
concerns surfaced in PR #56's self-review and tracked in issue #57.

**`is_mql` removed from the canonical feature list.** Every lead is
initialised at MQL stage in `simulation/population.py`, making the
column constant `True` and zero-variance — no information for modelling.
The `LeadRow.is_mql` field is retained on the relational `leads.parquet`
table; only the snapshot/task-split column and feature-dictionary row
are removed. Affects all exposure modes.

**`is_sql` redacted in `student_public` mode.** Measured on the v2
bundles: P(converted | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across
intro / intermediate / advanced — at advanced tier essentially
deterministic for the negative class.  `is_sql` is added to
`redact_in_modes={ExposureMode.student_public}`; the redaction
machinery from PR #56 picks it up automatically.  `is_sql` remains in
`research_instructor` exports.

**`BUNDLE_SCHEMA_VERSION` bumped to "3".** Removing `is_mql` changes the
canonical feature set in every mode (not just a mode-conditional
redaction), so a future external consumer pinning the v2 schema and
checking for `is_mql` would silently break.  Bumping is what the field
is for.

**New regression test.** `test_no_zero_variance_features` in
`tests/exposure/test_redaction.py` reads each published student_public
task split and asserts every non-ID, non-target column has at least
two distinct values.  Guards against the `is_mql` zero-variance
regression and any future zero-variance feature slipping in.

Bundle column counts (v3):
  - student_public/{intro,intermediate,advanced}: 32 (was 34)
  - research_instructor/intermediate_instructor: 34 (was 35)

Verification:
  - 903 tests passing (was 902; +1 zero-variance regression test)
  - ruff + mypy clean
  - All 4 release bundles regenerated; all pass `validate_bundle()`
  - `verify_hash_determinism.py` reports 73/73 identical across builds

Open follow-up: issue #57 sub-item 1 (event-aggregate features computed
over the label window) — its own PR, will likely bump the schema again.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 4, 2026 12:26
@shaypal5 shaypal5 added type: bugfix Fixes a bug layer: schema schema/ entity/event contracts layer: render render/ bundle and artifact output labels May 4, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the public-release feature surface to address two issue #57 leakage follow-ups: removing zero-variance is_mql from the canonical snapshot feature list and redacting is_sql from student-facing task exports, with the matching schema/version, docs, and tests updated around that change.

Changes:

  • Remove is_mql from canonical snapshot features and mark is_sql for student_public redaction.
  • Bump bundle schema version to v3 and refresh release/changelog/docs to describe the new published column set.
  • Add/adjust regression tests around redaction and realism validation inputs.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/validation/test_realism.py Updates the boolean-dtype realism regression to corrupt opportunity_created instead of removed is_mql.
tests/exposure/test_redaction.py Adds a new zero-variance regression test for student-public task exports.
release/README.md Updates public release summary counts and leakage-handling documentation for schema v3.
release/HF_DATASET_CARD.md Updates Hugging Face dataset card counts and leakage notes for the public bundles.
leadforge/schema/features.py Removes is_mql from canonical snapshot features and redacts is_sql in student_public.
leadforge/render/manifests.py Bumps BUNDLE_SCHEMA_VERSION from 2 to 3 and documents schema history.
CHANGELOG.md Adds unreleased notes for schema v3, column-count changes, and follow-up status.
.agent-plan.md Marks issue #57 sub-items 2 and 3 as resolved and leaves the windowed-snapshot work open.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +156 to +164
"Whether the lead had achieved SQL status at snapshot date. "
"Strongly correlated with the label: the simulator only converts "
"non-SQL leads via a rare direct-conversion path, so "
"is_sql=False predicts non-conversion with very high probability "
"(P(conv | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across difficulty "
"tiers). Redacted from student_public bundles.",
"lead_meta",
leakage_risk=True,
redact_in_modes=frozenset({ExposureMode.student_public}),
Comment on lines +132 to 145
"""
df = pd.read_parquet(bundle / "tasks/converted_within_90_days/train.parquet")
exempt = {"account_id", "contact_id", "lead_id", "lead_created_at"}
target = next(f.name for f in LEAD_SNAPSHOT_FEATURES if f.is_target)
exempt.add(target)
for col in df.columns:
if col in exempt:
continue
assert df[col].nunique(dropna=False) >= 2, (
f"feature {col!r} has zero variance in the published "
f"student_public bundle ({df[col].nunique(dropna=False)} distinct value)"
)


Comment on lines +163 to +164
leakage_risk=True,
redact_in_modes=frozenset({ExposureMode.student_public}),
Comment thread CHANGELOG.md Outdated
Comment on lines +30 to +31
`validate_bundle()` now flags any zero-variance feature in the published
student_public task split (excluding ID columns and the target).
Comment thread release/README.md Outdated
Comment on lines +133 to +138
Three columns in the original simulator output were either label-encoding or near-deterministic for the negative class. The exposure layer handles them as follows:

**Known caveats** (see [PR #56](https://github.com/leadforge-dev/leadforge/pull/56) for the discussion):
- `current_stage` -- at the 90-day horizon, contains terminal stages (`closed_won`/`closed_lost`) that encode the label directly. **Stripped from `student_public` bundles**; available in `intermediate_instructor/`.
- `is_sql` -- `is_sql=False` predicts non-conversion with very high probability (P(conv | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across intro / intermediate / advanced). **Stripped from `student_public` bundles**; available in `intermediate_instructor/`.
- `is_mql` -- removed entirely (in all modes): every lead is initialised at MQL stage in the simulator, so the column was constant `True` and zero-variance.
- `total_touches_all` -- counts touches over the full 90-day window, including post-snapshot events. **Deliberately retained** as a pedagogical trap (flagged `leakage_risk=True` in the dictionary). Use it as an exercise: train with and without it, compare AUC, explain the gap.
Comment on lines +97 to +98
- **Stripped from public bundles:** `current_stage` (label-encoding at the 90-day horizon) and `is_sql` (P(conv | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across tiers — near-deterministic for non-conversion). Both available in `intermediate_instructor/`. The `manifest.json` field `redacted_columns` lists what was stripped.
- **Removed entirely:** `is_mql` (constant `True` in the simulator — zero variance, no information).
Addresses senior-dev review of #58.  Six fixes:

1. **Close the relational-table bypass.**  The previous commit redacted
   columns only from the snapshot/task splits.  Users following the
   README's "Option 3" (feature engineering off `tables/leads.parquet`)
   could trivially rejoin redacted columns by `account_id`/`contact_id`.
   Apply `redacted_columns_for(mode)` uniformly to every published
   parquet under both `tables/` and `tasks/`.  In student_public
   bundles, `tables/leads.parquet` no longer carries `current_stage` or
   `is_sql`.  This also closes the equivalent bypass that PR #56 left
   open for `current_stage`.

2. **Fully remove `is_mql`.**  Previously left as a half-measure: the
   field was dropped from the snapshot but kept on `LeadRow` and so
   still appeared as a constant-True column in `tables/leads.parquet`.
   Now removed from the entity dataclass, `DTYPE_MAP`,
   `population.py`, and `engine.py`.  Gone from every artifact in
   every mode.

3. **Multi-seed measurement for the is_sql claim.**  The PR description's
   single-seed numbers overstated the case ("99.4% deterministic").
   Re-measured across 5 seeds on full-size bundles:
   P(conv | is_sql=False) = 0.061 ± 0.026 (intro) / 0.020 ± 0.010
   (intermediate) / 0.011 ± 0.004 (advanced).  Pattern holds, framing
   is now honest.  CHANGELOG and release/README.md updated with the
   new numbers.

4. **Strengthen the zero-variance test.**  The strict `nunique >= 2`
   guard caught `is_mql` but missed the weaker "rarest class is 1
   row" pattern.  For low-cardinality columns on bundles ≥ 200 rows,
   additionally assert the rarest class has ≥ 1% of rows (skipped on
   tiny test fixtures where the threshold would false-positive).

5. **Add an explicit v3 schema contract.**
   `tests/render/test_bundle_schema_v3_contract.py` pins the exact
   column set per mode for both `tasks/` and `tables/leads.parquet`.
   The duplication is intentional: any future change touching the
   feature spec, LeadRow, or redaction policy must update this file
   *and* bump `BUNDLE_SCHEMA_VERSION`.  A bare schema-changing PR
   that doesn't touch this file fails here.

6. **Parameterize the realism boolean test.**  `test_detects_non_boolean
   _feature` no longer hard-codes `opportunity_created`.  It picks
   the first non-target boolean from `LEAD_SNAPSHOT_FEATURES` at test
   time so future column renames don't break it for unrelated reasons.

Validation knock-on: `_check_stage_distribution` (realism.py) and the
seed-drift `current_stage` collection (drift.py) now skip when the
column is missing — consistent with the new redaction surface.
`check_exposure_monotonicity` extended to allow legitimate column
differences in `tables/leads.parquet` between modes (subset of the
expected redaction set).

Tests:
- 911 passing (was 903; +8 v3 contract tests, +1 stronger zero-variance,
  -1 removed `is_mql` simulation test, others updated).
- ruff + mypy clean.
- All 4 release bundles regenerate and validate.
- `verify_hash_determinism.py`: 73/73 files identical across builds.

Per-bundle column counts after this commit:
  student_public: task=32, leads.parquet=9
  research_instructor: task=34, leads.parquet=11

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shaypal5

shaypal5 commented May 4, 2026

Copy link
Copy Markdown
Contributor Author

Self-review pass: 8 issues raised, 8 addressed in commit 55e5948

# Issue Action
1 Redaction bypassable via tables/leads.parquet (severe) redacted_columns_for(mode) now applies uniformly to every published parquet under both tables/ and tasks/. In student_public, tables/leads.parquet lost current_stage and is_sql. Also retroactively closes the same bypass that PR #56 left open.
2 is_mql only half-removed Fully removed: LeadRow, DTYPE_MAP, population.py, engine.py, every parquet, every mode.
3 "Near-deterministic" claim from one seed Re-measured across 5 seeds: P(conv | is_sql=False) = 0.061 ± 0.026 / 0.020 ± 0.010 / 0.011 ± 0.004. Pattern holds, framing is now honest. PR description and CHANGELOG updated.
4 Zero-variance test too lax Now asserts low-cardinality columns have rarest class ≥ 1% of rows on bundles ≥ 200 rows; strict nunique >= 2 retained for everything.
5 Schema bump without a contract test New tests/render/test_bundle_schema_v3_contract.py (8 tests) pins the exact column set per mode for tasks AND tables/leads.parquet. Future schema-changing PRs must update this file.
6 test_detects_non_boolean_feature couples to a specific column Parameterized to pick the first non-target boolean from LEAD_SNAPSHOT_FEATURES at test time.
7 Two changes in one PR Won't split — they share a schema bump and test infrastructure; splitting now would cost more than it gains. Noted as a "next time" lesson.
8 PR description claims user-confirmation in the PR thread Edited PR body — confirmation was in the dev chat before opening.

Verification after refactor:

  • 911 tests passing (was 903; +8 contract tests)
  • ruff + mypy clean
  • 4/4 release bundles regenerate and validate
  • 73/73 files hash-identical across two builds
  • tables/leads.parquet student_public: 9 cols (no current_stage, no is_sql, no is_mql)
  • tables/leads.parquet instructor: 11 cols (is_mql gone everywhere; current_stage and is_sql retained)

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #58 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: leadforge/schema/features.py:164
URL: https://github.com/leadforge-dev/leadforge/pull/58#discussion_r3181550070
Root author: copilot-pull-request-reviewer

Comment:
    Marking `is_sql` as redacted here only removes it from the snapshot/task exports. `write_bundle()` still publishes `tables/leads.parquet` unchanged in `student_public`, so anyone following the documented raw-table workflow can still recover `is_sql` from the public bundle and get the same near-deterministic signal this PR is trying to remove.

## COPILOT-2
Location: tests/exposure/test_redaction.py:169
URL: https://github.com/leadforge-dev/leadforge/pull/58#discussion_r3181550105
Root author: copilot-pull-request-reviewer

Comment:
    This regression check only reads `train.parquet` from the default-difficulty student bundle. The release publishes three student_public tiers and three splits per tier, so a zero-variance column that only shows up in `valid`/`test` or in intro/advanced would still slip through CI.

## COPILOT-3
Location: leadforge/schema/features.py:164
URL: https://github.com/leadforge-dev/leadforge/pull/58#discussion_r3181550115
Root author: copilot-pull-request-reviewer

Comment:
    There still isn't a test that pins `is_sql` as a required public-bundle redaction. The current redaction assertions derive their expected set from `redacted_columns_for()`, so if a future edit drops `is_sql` from `redact_in_modes`, the writer, validator, and tests will all change in lockstep and CI will stay green.

## COPILOT-4
Location: CHANGELOG.md
URL: https://github.com/leadforge-dev/leadforge/pull/58#discussion_r3181550135
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The release notes say `validate_bundle()` now rejects zero-variance features, but no validator code was added for that. `validate_bundle()` still delegates to the existing realism checks, which do not inspect per-column variance, so this changelog entry overstates the shipped behavior.

## COPILOT-5
Location: release/README.md
URL: https://github.com/leadforge-dev/leadforge/pull/58#discussion_r3181550158
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    These bullets read as if `is_sql` and `is_mql` are absent from the entire public bundle, but `student_public/tables/leads.parquet` is still written unchanged and still contains both columns. That will mislead users following the raw-table feature-engineering workflow documented earlier in this README.

## COPILOT-6
Location: release/HF_DATASET_CARD.md:98
URL: https://github.com/leadforge-dev/leadforge/pull/58#discussion_r3181550179
Root author: copilot-pull-request-reviewer

Comment:
    This section currently says `is_sql` is stripped from public bundles and `is_mql` is removed entirely, but the published `student_public/tables/leads.parquet` still contains both fields. Consumers who build features from the relational tables will get a different schema than this card promises.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25319870252 attempt 1
Comment timestamp: 2026-05-04T12:48:38.614155+00:00
PR head commit: 55e5948f3ff18f02baf3e2f781c64b6f28e34db9

@shaypal5 shaypal5 merged commit c908cf4 into main May 4, 2026
8 checks passed
@shaypal5 shaypal5 deleted the fix/issue-57-is-sql-is-mql branch May 4, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: render render/ bundle and artifact output layer: schema schema/ entity/event contracts type: bugfix Fixes a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants